code element
Evaluating the Use of LLMs for Documentation to Code Traceability
Alor, Ebube, Khatoonabadi, SayedHassan, Shihab, Emad
Large Language Models (LLMs) offer new potential for automating documentation-to-code traceability, yet their capabilities remain underexplored. We present a comprehensive evaluation of LLMs (Claude 3.5 Sonnet, GPT-4o, and o3-mini) in establishing trace links between various software documentation (including API references and user guides) and source code. We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI). Through systematic experiments, we assess three key capabilities: (1) trace link identification accuracy, (2) relationship explanation quality, and (3) multi-step chain reconstruction. Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets, substantially outperforming our baselines (TF-IDF, BM25, and CodeBERT). While fully correct relationship explanations range from 42.9% to 71.1%, partial accuracy exceeds 97%, indicating that fundamental connections are rarely missed. For multi-step chains, LLMs maintain high endpoint accuracy but vary in capturing precise intermediate links. Error analysis reveals that many false positives stem from naming-based assumptions, phantom links, or overgeneralization of architectural patterns. We demonstrate that task-framing, such as a one-to-many matching strategy, is critical for performance. These findings position LLMs as powerful assistants for trace discovery, but their limitations could necessitate human-in-the-loop tool design and highlight specific error patterns for future research.
SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis
Li, Yansong, Branco, Paula, Hoole, Alexander M., Marwah, Manish, Koduvely, Hari Manassery, Jourdan, Guy-Vincent, Jou, Stephan
As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes increasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often overlook the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce SV-TrustEval-C, a benchmark designed to evaluate LLMs' abilities for vulnerability analysis of code written in the C programming language through two key dimensions: structure reasoning - assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning - examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the SV-TrustEval-C benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our initial benchmark dataset is publicly available.
CodePurify: Defend Backdoor Attacks on Neural Code Models via Entropy-based Purification
Mu, Fangwen, Wang, Junjie, Yu, Zhuohao, Shi, Lin, Wang, Song, Li, Mingyang, Wang, Qing
Neural code models have found widespread success in tasks pertaining to code intelligence, yet they are vulnerable to backdoor attacks, where an adversary can manipulate the victim model's behavior by inserting triggers into the source code. Recent studies indicate that advanced backdoor attacks can achieve nearly 100% attack success rates on many software engineering tasks. However, effective defense techniques against such attacks remain insufficiently explored. In this study, we propose CodePurify, a novel defense against backdoor attacks on code models through entropy-based purification. Entropy-based purification involves the process of precisely detecting and eliminating the possible triggers in the source code while preserving its semantic information. Within this process, CodePurify first develops a confidence-driven entropy-based measurement to determine whether a code snippet is poisoned and, if so, locates the triggers. Subsequently, it purifies the code by substituting the triggers with benign tokens using a masked language model. We extensively evaluate CodePurify against four advanced backdoor attacks across three representative tasks and two popular code models. The results show that CodePurify significantly outperforms four commonly used defense baselines, improving average defense performance by at least 40%, 40%, and 12% across the three tasks, respectively. These findings highlight the potential of CodePurify to serve as a robust defense against backdoor attacks on neural code models.
RAMBO: Enhancing RAG-based Repository-Level Method Body Completion
Bui, Tuan-Dung, Luu-Van, Duc-Thieu, Nguyen, Thanh-Phat, Nguyen, Thu-Trang, Nguyen, Son, Vo, Hieu Dinh
Code completion is essential in software development, helping developers by predicting code snippets based on context. Among completion tasks, Method Body Completion (MBC) is particularly challenging as it involves generating complete method bodies based on their signatures and context. This task becomes significantly harder in large repositories, where method bodies must integrate repositoryspecific elements such as custom APIs, inter-module dependencies, and project-specific conventions. In this paper, we introduce RAMBO, a novel RAG-based approach for repository-level MBC. Instead of retrieving similar method bodies, RAMBO identifies essential repository-specific elements, such as classes, methods, and variables/fields, and their relevant usages. By incorporating these elements and their relevant usages into the code generation process, RAMBO ensures more accurate and contextually relevant method bodies. Our experimental results with leading code LLMs across 40 Java projects show that RAMBO significantly outperformed the state-of-the-art repository-level MBC approaches, with the improvements of up to 46% in BLEU, 57% in CodeBLEU, 36% in Compilation Rate, and up to 3X in Exact Match. Notably, RAMBO surpassed RepoCoder Oracle method by up to 12% in Exact Match, setting a new benchmark for repository-level MBC.
Sketch Then Generate: Providing Incremental User Feedback and Guiding LLM Code Generation through Language-Oriented Code Sketches
Zhu-Tian, Chen, Xiong, Zeyu, Yao, Xiaoshuo, Glassman, Elena
Existing solutions (e.g., [11, 21, 25, Crafting effective prompts for code generation or editing with Large 30]), while useful, fall short in providing user feedback during the Language Models (LLMs) is not an easy task. Particularly, the absence prompt crafting process. The lack of such feedback can lead to of immediate, stable feedback during prompt crafting hinders uncertainty about the expected output and potential inefficiencies effective interaction, as users are left to mentally imagine possible in the development process, as users are left to mentally imagine outcomes until the code is generated. In response, we introduce possible outcomes until the code is generated [12, 29]. Language-Oriented Code Sketching, an interactive approach that provides However, providing user feedback during prompt crafting presents instant, incremental feedback in the form of code sketches (i.e., significant challenges. Ideally, feedback should not only be instant incomplete code outlines) during prompt crafting. This approach but also evolve incrementally as the user types their prompt, offering converts a prompt into a code sketch by leveraging the inherent clear and understandable intermediate results that progressively linguistic structures within the prompt and applying classic natural guide the LLM towards the final desired code.
Predicting Line-Level Defects by Capturing Code Contexts with Hierarchical Transformers
Mahbub, Parvez, Rahman, Mohammad Masudur
Software defects consume 40% of the total budget in software development and cost the global economy billions of dollars every year. Unfortunately, despite the use of many software quality assurance (SQA) practices in software development (e.g., code review, continuous integration), defects may still exist in the official release of a software product. Therefore, prioritizing SQA efforts for the vulnerable areas of the codebase is essential to ensure the high quality of a software release. Predicting software defects at the line level could help prioritize the SQA effort but is a highly challenging task given that only ~3% of lines of a codebase could be defective. Existing works on line-level defect prediction often fall short and cannot fully leverage the line-level defect information. In this paper, we propose Bugsplorer, a novel deep-learning technique for line-level defect prediction. It leverages a hierarchical structure of transformer models to represent two types of code elements: code tokens and code lines. Unlike the existing techniques that are optimized for file-level defect prediction, Bugsplorer is optimized for a line-level defect prediction objective. Our evaluation with five performance metrics shows that Bugsplorer has a promising capability of predicting defective lines with 26-72% better accuracy than that of the state-of-the-art technique. It can rank the first 20% defective lines within the top 1-3% suspicious lines. Thus, Bugsplorer has the potential to significantly reduce SQA costs by ranking defective lines higher.
TypeT5: Seq2seq Type Inference using Static Analysis
Wei, Jiayi, Durrett, Greg, Dillig, Isil
There has been growing interest in automatically predicting missing type annotations in programs written in Python and JavaScript. While prior methods have achieved impressive accuracy when predicting the most common types, they often perform poorly on rare or complex types. In this paper, we present a new type inference method that treats type prediction as a code infilling task by leveraging CodeT5, a state-of-the-art seq2seq pre-trained language model for code. Our method uses static analysis to construct dynamic contexts for each code element whose type signature is to be predicted by the model. We also propose an iterative decoding scheme that incorporates previous type predictions in the model's input context, allowing information exchange between related code elements. Our evaluation shows that the proposed approach, TypeT5, not only achieves a higher overall accuracy (particularly on rare and complex types) but also produces more coherent results with fewer type errors -- while enabling easy user intervention.
Semantic Clone Detection via Probabilistic Software Modeling
Thaller, Hannes, Linsbauer, Lukas, van Bladel, Brent, Egyed, Alexander
Semantic clone detection is the process of finding program elements with similar or equal runtime behavior. For example, detecting the semantic equality between the recursive and iterative implementation of the factorial computation. Semantic clone detection is the de facto technical boundary of clone detectors. This boundary was tested over the last years with interesting new approaches. This work contributes a semantic clone detection approach that detects clones with 0% syntactic similarity. We present Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM) as a stable and precise solution to semantic clone detection. PSM builds a probabilistic model of a program that is capable of evaluating and generating runtime data. SCD-PSM leverages this model and its model elements to finding behaviorally equal model elements. This behavioral equality is then generalized to semantic equality of the original program elements. It uses the likelihood between model elements as a distance metric. Then, it employs the likelihood ratio significance test to decide whether this distance is significant, given a pre-specified and controllable false-positive rate. The output of SCD-PSM are pairs of program elements (i.e., methods), their distance, and a decision whether they are clones or not. SCD-PSM yields excellent results with a Matthews Correlation Coefficient greater 0.9. These results are obtained on classical semantic clone detection problems such as detecting recursive and iterative versions of an algorithm, but also on complex problems used in coding competitions.